Simple Regenerating Codes: 
Network Coding for Cloud Storage 



Dimitris S. Papailiopoulos^, Jianqiang Luo*, Alexandras G. Dimakis^, Cheng Huang*, and Jin Li* 
^University of Southern California, Los Angeles, CA 90089, Email:{papailio, dimakis}@usc . edu 
■fWayne State University, Detroit, MI 48202, Email: jianqiang@wayne.edu 
*Microsoft Research, Redmond, WA 98052, Email: {cheng . huang, j inl}@microsof t . com 



Abstract — Network codes designed specifically for distributed 
storage systems have the potential to provide dramatically higher 
storage efficiency for the same availability. One main challenge 
in the design of such codes is the exact repair problem: if a node 
storing encoded information fails, in order to maintain the same 
level of reliability we need to create encoded information at a 
new node. One of the main open problems in this emerging area 
has been the design of simple coding schemes that allow exact 
and low cost repair of failed nodes and have high data rates. In 
particular, all prior known explicit constructions have data rates 
bounded by 1/2. 

In this paper we introduce the first family of distributed 
storage codes that have simple look-up repair and can achieve 
arbitrarily high rates. Our constructions are very simple to 
implement and perform exact repair by simple XORing of 
packets. We experimentally evaluate the proposed codes in a 
realistic cloud storage simulator and show significant benefits in 
both performance and reliability compared to replication and 
standard Reed-Solomon codes. 

I. Introduction 

Distributed storage systems have reached such a massive 
scale that recovery from failures is now part of regular 
operation rather than a rare exception |23) . Large scale de- 
ployments typically need to tolerate multiple failures, both 
for high availability and to prevent data loss. Erasure coded 
storage achieves high failure tolerance without requiring a 
large number of replicas that increase the storage cost (9). 
Three application contexts where erasure coding techniques 
are being currently deployed or under investigation are Cloud 
storage systems, archival storage, and peer-to-peer storage 
systems like Cleversafe and Wuala (see e.g. (2j, ||3), |5j, JSJ, 

One central problem in erasure coded distributed storage 
systems is that of maintaining an encoded representation when 
failures occur. To maintain the same redundancy when a 
storage node leaves the system, a newcomer node has to join 
the array, access some existing nodes, and exactly reproduce 
the contents of the departed node. Repairing a node failure in 
an erasure coded system requires in-network combinations of 
coded packets, a concept called network coding. Network cod- 
ing has been investigated for numerous applications including 
p2p systems, wireless ad hoc networks and various storage 
problems (see e.g. (6), (7), (B))- 

In this paper we focus on network coding techniques for 
exact repair of a node failure in an erasure coded storage 
system RJ, J2}. There are several metrics that can be optimized 



during repair: the total information read from existing disks 
during repair fTT) , p2) , the total information communicated 
in the network |fl4|l, ||16[-p2| (called repair bandwidth HI), 
or the total number of disks required for each repair [8|, [13]. 

Currently, the most well-understood metric is that of repair 
bandwidth. For designing (n, k) erasure codes that have n 
storage nodes and can tolerate any n — k failures, an informa- 
tion theoretic tradeoff between the repair bandwidth 7 and the 
storage per node a was established in [4|, using cut-set bounds 
on an information flow graph. Explicit code constructions 
exist for the the two extreme points on this bandwidth-storage 
tradeoff, see e.g. |2j, (5). Despite this substantial amount 
of prior work, there are no practical code constructions of 
efficiently repairable codes with data rates above 1/2. Further, 
different performance metrics might be of interest in different 
applications. It seems that for cloud storage applications the 
main performance bottleneck is the disk I/O overhead for 
repair, which is proportional to the number of nodes d involved 
in rebuilding a failed node. 

Our Contribution: In this paper we introduce the first 
family of distributed storage codes that have simple look-up 
repair and can achieve arbitrarily high rates. Our constructions 
are very simple to implement and perform exact repair by 
simple packet combinations. Specifically, we design simple 
regenerating codes (SRC) that have high-rate, very small disk- 
I/O d, and minimal repair computation. 

An (n, k, /)-SRC is a code for n storage nodes that can 
tolerate n — k erasures, where each node stores a fraction 

fk 

of the file size in coded chunks. To repair a single coded chunk 
we need to access / disks and read 1 chunk from each disk. 

j 1 -1 

The regeneration of an entire lost node costs a fraction J-j— in 
repair bandwidth and d = 2/ disk accesses. Our codes have 
rate R — which can be made arbitrarily close to j^f, 

for constant in k erasure resiliency. 

We experimentally evaluate the proposed codes in a realistic 
cloud storage simulator that models node rebuilds in Hadoop. 
Our simulator was initially validated on a real Hadoop sys- 
tem of 16 machines connected by a lGB/s network. Our 
subsequent experiment involves 100 machines and compares 
the performance of SRC to replication and standard Reed- 
Solomon codes. We find that SRCs add a new attractive point 
in the design space of redundancy mechanisms for cloud 
storage. 
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Fig. 1. Example of a (4, 2, 2)-SRC. n = 4 storage nodes, any k = 2 recover the data and XORs of degree / = 2 provide simple repair. 



II. Simple Regenerating Codes 

The first requirement from our storage code is the (n, k) 
property: a code will be storing information in n storage nodes 
and should be able to tolerate any combination of n — k 
failures without data loss. We refer to codes that have this 
reliability as "(n, k) erasure codes" or codes that have "the 
(n, k) property T 

One well-known class of erasure codes that have this 
property is the family of maximum distance separable (MDS) 
codes |5), pO) . In short, an MDS code is a way to take a 
data object of size M, split it into chunks of size M/k and 
create n chunks of the same size that have the (n, k) property. 
It can be seen that MDS codes achieve the (n, k) property 
with the minimum storage overhead possible: any k storage 
nodes jointly store M bits of useful information, which is the 
minimum possible to guarantee recovery. 

Our second requirement is efficient exact repair [5|. When 
one node fails or becomes unavailable, the stored information 
should be easily reconstructable using other surviving nodes. 
Simple regenerating codes achieve the (n, k) property and 
simple repair simultaneously by separating the two problems. 
Large MDS codes are used to provide reliability against any 
n — k failures while very simple XORs applied over the MDS 
coded packets provide efficient exact repair when single node 
failures happen. 
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Fig. 2. File reconstruction of a (4, 2, 2)-SRC. 

We give a first overview of our construction through 



simple example in Fig. [I] which shows an (n = 4, k = 
2, / = 2)-SRC. The original data object is split in 4 chunks 
h,f2,h>fi- We first encode [fx f 2 ] in [x x x 2 x 3 x±] and 
[h fa] in [yi y 2 2/3 2/4] using any standard (4,2) MDS 
code. This can be easily done by multiplication of the data 
with the 2x4 generator matrix G of the MDS code to form 
[xi x 2 x 3 x 4 ] = [fx f 2 ]G and [y x y 2 y 3 y 4 ] = [f 3 U]G. Then 
we generate a parity out of each "level" of coded chunks, i.e., 
Si = xi + yj, which results in an aggregate of 12 chunks. We 
circularly place these chunks in 4 nodes, each storing 3, as 
shown in Fig. 1. 

It is easy to check that this code has the (n, k) property and 
in Fig. [2] we show an example by failing nodes 1 and 4. Any 
two nodes contain two 2^ and two ?/, chunks which through 
the outer MDS codes can be used to recover the original data 
object. We note that the parity chunks are not used in this 
process, which shows the sub-optimality of our construction. 
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Fig. 3. The repair of node 1 in a (4, 2, 2)-SRC 

In Fig. [3] we give an example of a single node repair 
of the (4, 2, 2)-SRC. We assume that node 1 is lost and a 
newcomer joins the system. To reconstruct x\, the newcomer 
has to download yi and si from nodes 3 and 4. This simple 
repair scheme is possible due to the way that we placed the 
chunks in the 4 storage nodes: each node stores 3 chunks with 



different index. The newcomer reconstructs each lost chunk by 
downloading, accessing, and XORing 2 other chunks. In this 
process the outer MDS codes are not used. 

In short our codes combine outer MDS codes and sim- 
ple parities to provide fault tolerance and efficient repair 
respectively. Due to this separation of duties, our codes are 
suboptimal. However, as we show subsequently this optimality 
loss corresponds to asymptotically negligible loss in storage 
efficiency and only a logarithmic factor overhead compared to 
the optimal information theoretic storage bounds. 



subscript. This subscript requirement can be guaranteed by 
the following circular placement of chunks in the i-th node 
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(4) 



where i € [n] and denotes modulus addition on the ring 
{1, . . . , n} (for example n© 1 = 1). The above circular chunk 
placement results in the following coded array of n storage 
nodes 



A. The f — 2 Case: degree 2 parities 

We now present our general SRC construction for the / = 2 
case. 
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B. Code Construction, Erasure Resiliency, and Rate 

Let a file f , of size M = 2k, that we cut into 2 parts, say 

f _ |f(l) f(2) 



(1) 



where f« G F lxfe , for i G [2], where [N] = {1, . . . , N} and 
F is the finite field over which all operations are performed. 
Our coding process, is a two-step one: first we independently 
encode each of the file parts using an outer MDS code and 
generate simple parity sum out of them. Then we store the 
coded chunks and the parity sum chunks in a specific way in 
n storage components. This encode and place scheme enables 
easy repair of lost chunks and arbitrary erasure tolerance. 

We start with an (n, k) MDS code that we use to encode 
independently each of the 2 file parts of size k, fW and f( 2 \ 
into two coded vectors, x and y, of length n. This encoding 
process is given by 



f (1) G and y = f (2) G, 



(2) 



where G G F fcxn is the outer MDS code generator matrix. 
We pose no requirements on that MDS code, in the sense 
that any (n, k) MDS design will work for our purposes. The 
maximum distance of the code ensures that any k encoded 
chunks of x can reconstruct f^ 1 '; the same goes for any k 
chunks from y, i.e., we can use them to reconstruct fW. We 
continue by generating a parity sum vector by adding the two 
coded vectors x and y 



(3) 



where s; = xi + yf, we note that the index I of the parity 
sum si is the same as the subscript of the 2 coded chunks that 
generate it. This process yields 3n chunks: 2n coded chunks 
in the vectors x and y, and n parity sum chunks, i.e., the 
vector s = x + y. 

We proceed by placing these in chunks in n storage nodes 
in the following way: each storage node will be storing 3 
chunks, one from x, one from y, and one from the parity 
vector s. We require that these 3 chunks do not share a 



We can observe that for n > 2, indeed the 3 chunks of each 
node do not share a subscript. 

C. Erasure Resiliency and Effective Coding Rate 

In this section, we present the erasure resiliency and coding 
rate of the (n, k, 2)-SRC and prove the following theorem. 
Due to lack of space we do not present some proofs in full 
length and we give sketches instead. The extended version of 
the paper with full proofs can be found online at Q]. 

Theorem 1: The (n, k, 2)-SRC can tolerate any possible 
combination n — k erasures and has effective coding rate I-—. 

to 3 n 

Proof Sketch: The (n, k) property of the SRC is inherited by 
the underlying MDS outer codes: we can always retrieve the 
file by connecting to any subset of k nodes of the storage array. 
Any subset of k nodes contain k chunks of each of the two 
file parts f W and f( 2 \ which can be retrieved by inverting the 
corresponding k x k submatrices of the MDS generator matrix 
G Hence, the (n, k) property of the two identical outer MDS 
pre-codes renders gives the (n, k, 2)-SRC its (n, k) property. 

We proceed by calculating the coding rate (space efficiency) 
R of the (n, k, 2)-SRC, by considering the ratio of the total 
amount of useful stored information, to the total amount of 
data that is stored. That is, the ratio of the initial file size to 
the expedited storage 



R = 



file size 



2 • k 



storage spent 3 • n 



(5) 



□ 



Hence, the (n, k, 2)-SRC is an erasure code with rate upper 
bounded by |: for fixed erasure tollerance, n — k = m, the 
SRC can have rate arbitrarily close to |, that is, 



3 k + m 



(6) 



The (n, k, 2) SRC construction that is presented in this 
section can be generalize to constructions where the rate can 
be made arbitrarily high. This is done by increasing the amount 
of chunks stored per node and the degree of the parity sums 
from 2 to /. These constructions are presented in Section III. 



D. Repairing Lost Chunks 

For the general (n, k, 2)-SRC, when a single node is lost, 
or a single chunk of that lost node is requested to be accessed, 
the repair process is initiated. To sustain high data availability 
in the presence of chunk and node erasures, the repair process 
has to be fast and simple: it should be low cost with respect 
to information read, communicated, and with respect to the 
number of total disk accesses. The circular placement of 
chunks in the SRC enables easy repair of single lost chunks, 
or single node failures, with respect to the aforementioned 
metrics. This is due to the fact that each chunk that is lost 
shares an index with 2 more chunks stored in 2 distinct nodes. 
By contacting these 2 remaining nodes, we can repair the lost 
chunk by a simple XOR operation. For the repair of a single 
chunk or a single node, we have the following theorem. 

Theorem 2: The repair of a single chunk of the (n, k, 2)- 
SRC costs 2 in repair bandwidth and chunk reads, that is a 
fraction i of the file size, and 2 disk accesses. Moreover, the 
repair of a single node failure costs 6 in repair bandwidth and 
chunk reads, that is a fraction | of the file size, and 4 in disk 
accesses. 

Proof: Let for example node i <E [n] fail, that is, chunks 
x i> 2/i©i> and s i©2 are lost. Then, a newcomer joins the 
storage array and wishes to regenerate the lost information. 
To reconstruct the newcomer connects to the two chunks 
available in the storage system that share the same subscript i, 
i.e., it connects to the node that contains the parity Sj and to 
the node that contains the chunk t/j. The newcomer can then 
restore the lost chunk Xi simply by subtracting yt from the 
parity Sj. This repair process is summarized in the following 
3 steps. 



Step 


Repair chunk 


1 


Access Disk !0 1 and download yi 


2 


Access Disk 10 2 and download S; 


3 


restore x^ '•= Si — Xi 



where 9 is subtraction on the ring {1,. . . ,n} (for example 
1 9 1 = n). We follow the same manner to repair 



Step 


Repair chunk j/j0i: 


1 


Access Disk i©l and download 


2 


Access Disk iQl and download Si®i 


3 


restore := s i(sl - x iS)1 



The parity repair is also similar, we need to access the 2 nodes 
that contain the coded chunks x ifS 2, and y i(S2 and sum them: 



Step 


Repair chunk s 4 02 : 


1 


Access Disk i©2 and download £^92 


2 


Access Disk i©l and download yi<$,2 


3 


restore s i(s2 := x iB2 + Vi®2 



From the above, we observe that the repair of a single 
chunk contained in a storage node requires 2 disk accesses, 
2 chunk reads, and 2 downloads. Moreover, to repair a single 
node failure an aggregate of 6 chunk reads and 6 downloads 
is required. The set of disks that are accessed to repair all 
chunks of nodes i is {i 9 2, i 9 1, i 9 1, i 9 2}, for i € [n], 
Hence, the number of disk accesses is min(n — 1,4), and 



n — 1 is true when i 9 2 
example in Figures 1-3. 



) 2, as is the case in our (4, 2, 2) 

□ 



Remark 1: We would like to note that a repair would only 
fail, i.e., one of the packets that are used to regenerate lost 
information can not be retrieved only if n < 2. 

In the following section, we introduce the general code 
construction of the (n, k, /)-SRC, where we consider its rate, 
reliability, repair properties, and analyze its asymptotics. 

III. SRC: The General Construction 

In this section we generalize the / = 2 construction, to 
the (n, k, /)-SRC. For the general (n, k, /)-SRC, we use / 
parallel and identical MDS outer pre-codes and generate a 
single parity vector from / encoded parts. We circularly place 
the generated chunks in n storage nodes. The (n, k, /)-SRC 
is an (n, k) erasure code with rate R = y+x^> i-e., the SRC 
always attains a fraction of the space efficiency of an 
(n, k) MDS code, for the same reliability, but with simple 
and low cost node repair. We perform single node repairs in 
the same manner as the / = 2 case: to repair a chunk, we 
access / nodes and perform a simple addition. For any /, the 
communication overhead to repair a single chunk is a fraction 
r of tha file size and the number of chunk reads and disk 
accesses is /, which can be constant and not necessarily a 
function of k. The repair of a single node failure costs (/ + 
1)4^ in repair bandwidth and we prove that the total number 
of disk accesses needed for a single node failure is exactly 2-f. 
We proceed by introducing the general code construction and 
showing its properties. 

A. Encoding, Erasure Resilience, and Rate 

Let a file f, of size M — fk, that is subpacketized in / 
parts, 



f = 



f (!)... f(/) 



(7) 



with each fM, i e [/], having size k. We encode each of the 
/ file parts independently, into vectors xW of length n, using 
an outer (n, k) MDS code. That is, we have 



x«=f( 1 )G, xW>=f('>G 



(8) 



where G is the n x k MDS generator matrix. 

Remark 2: The outer MDS code can be any scalar or array 
(n, k) MDS code, i.e., we pose no requirements on its design 
or finite field size. 

We generate a single parity sum vector from all the coded 
vectors 

/ 



■W 



(9) 



This process yields a total of fn coded chunks in the xW 
vectors and n parity chunks in s, i.e., we have an aggregate 
of (/ + l)n chunks available to place in n nodes. 



We will circularly place these (/ + l)n chunks in n storage 
nodes, with each node storing / coded chunks and 1 parity 
sum chunk, hence each node expends 

a S RC = /+l=^y 1 y (10) 

in storage capacity. The placement will again obey the property 
that enables easy repair: no two chunks within a storage node 
should share the same subscript. To ensure successful repair 
we also require that / < n. Below we state the circular 
placement of chunks in the i-th node, for i € [a] 



„( 2 ) 



„(/) 

^e(/-i) 



(11) 



which results in the following array of n storage nodes 
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Then, we have the following theorem. 

Theorem 3: The (n, fc, /)-SRC can tolerate any combina- 
tion of n — k node erasures and has coding rate ttVt ■ -. 

Proof Sketch: The / MDS pre-codes guarantee perfect file 
reconstruction posterior to any n — k erasures. The file can 
always be reconstructed by connecting to any k nodes: any 
collection of k nodes contain fk distinct coded chunks, k of 
each file part. Each of these fc-tuples of coded chunks can give 
back the information chunks of a single file part due to the / 
outer MDS codes. 

The effective coding rate of the (n, k, /)-SRC is equal to 
the ratio of the initial file size to the expedited storage, that is 

file size / • k 



Rsrc — 



storage spent (/ + 1) 



(12) 



□ 



By the above theorem we can claim that the rate of the SRC 
is a fraction j^- of the coding rate of an (n, k) MDS code, 
hence is upper bounded by 



f + lk 



/+1 



(13) 



□ 



In Fig. |4] we show how the effective coding rate of a 
(20, 16, /) SRC scales as a function of /, and compare it 
with that of a (20, 16) MDS code. Both codes can tolerate 4 
failures. We observe that as / increases the coding rate of the 
SRC approaches that of the MDS code. 
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Fig. 4. Rate comparison of a (20, 16, /)-SRC and a (20, 16)-MDS Code. 

B. Repairing Lost Elements 

In this subsection we prove the repair properties of the SRC, 
which are summarized in the following theorem. 

Theorem 4: The repair of a single chunk of the (n, k, /)- 
SRC, where each node stores asRc = ^y^ ■ jf, costs ^ in 
repair bandwidth and / in chunk reads, and disk accesses. The 
repair of a single node failure costs 



7src = (/ + 1) ; 



M 



(14) 



in repair bandwidth, /(/ + 1) = (/ + l)f in chunk reads, 
and 



rfsRC = min(2/,n - 1) 



(15) 



in disk accesses. 



Proof: Let node i £ [n] fail. A newcomer node can reconstruct 
the lost chunk a^fz-i) by accessing all / nodes in the set 

Si(i) = {ie(f-i + i),ie(f-2 + i),...,iei}\L (16) 

and downloading the chunk of each node that has the same 
subscript i © (I — 1) as the lost chunk. For example to 
reconstruct we need to perform the following steps: 



Step 


Repair chunk x - : 


1 


(2) 

Access Disk iQl and download 


2 


Access Disk 2 2 and download 






f-1 


Access Disk i©(/— 1) and download 


f 


Access Disk iQf and download Si 


f+1 


restore £Cj := Sj — }_^,_ 9 x\ 



Hence, repairing a single coded chunk requires / = ¥- chunk 
downloads, reads, / and disk accesses. To reconstruct the 
parity sum chunk Sj®/, we need to connect to the / nodes 
that contain the chunks xf^p I € [/] which generate it. 
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Fig. 5. (n, fc, /)-SRC Performance Comparison 



To repair a single node failure we need to communicate and 
read (/ + 1)/ = (/ + 1)tt symbols. The total number of disk 
accesses for a single node repair is given by the number of 
distinct indices in the set 

/+i 

Si = |J SS). (17) 
z=i 



To enumerate the distinct indices in 5,, we first count the 
number of distinct indices between sets Si(l) and Si(l + 1) 
for all I 6 [/]. We observe that 

5i(/)u5i(«+i) = {ie(/-n-0»*e(/-2+o,...,ie(/-i)}V 



and 



|S<(0 U5i(l + 1)| =/ + l, 



(18) 



that is, for any two "consecutive" chunk repairs, we need to 
access / + 1 storage nodes. Starting with / disk accesses for 
the first chunk repair, each additional chunk repair requires an 
additional disk access, with respect to what has already been 
accessed. The total number of disks accessed is 

rfsRC = (# of disks accesses for chunk as| ) 
+ (# of disks accesses for chunk a;^) 



+ (# of disks accesses for chunk Si©/) 
= /+ 1 + 1 + .. . + 1 = 2-/ 



(19) 



/additional disks accesses 

Therefore, to repair a single node failure an aggregate of 2/ 
disk accesses is required, when 2/ < n — 1. If 2/ > n — 1 
then the number of total disk accesses is n — 1. □ 

In Fig. [5] we give a comparison table between MDS, MSR, 
MBR, and Simple Regenerating Codes, with respect to 1) 
storage capacity per node a, 2) repair bandwidth per single 
node repair 7, 3) number of disk accesses per single node 
repair d, and 4) effective coding rate R. We consider MSR 
and MBR codes that connect to d — {k,n — 1} remaining 
nodes for a single node failure. Observe that the number of 
disk acceses in the SRC is a design parameter that can be 
set to a constant by appropriately choosing /, which can be 
orders less than k. 

Remark 3: Regenerating Codes [4| have the property that 
a single node failure can be repaired by any subset of d 
remaining nodes, and k < d < n — lis fixed by the specific 
code design. In sharp contrast, SRCs are look-up repair codes: 
for a single node failure, only a specific dsRc subset of the 



remaining nodes can reconstruct the file and g?src can be a 
constant, or a function of k that potentially grows much slower 
than 9(fc). 

C. Asymptotics of the SRC and links to MDS codes 

In this subsection, we consider the asymptotics of the SRC. 
What happens if we fix R — - and let the degree of parities 
/ grow as a function of kl Let for example 



/ = log(fe). 



(20) 



Then, the repair of a single node costs 7src = (log(fe) + 
l)M/k, with rfsRC = 2/ = 21og(fc). In comparison, a single 
node failure of an (n,k) MSR code costs 7msr = ^ExM/k. 
If we let k and n grow and fix R = — we obtain 



7SRC 

7msr 



log(fc) + 1 



1-1 

l—k 



log(fc) 

1/Rk- 



1 



= e(io g (fc)). (2i) 



(i/i?-i)/t 

The effective coding rate of the SRC is given by 

/ k = !og( fc ) fc fc^o ° R (22) 
/ + 1 n log(fc) + 1 n 

Therfore, compared to repair optimal MDS codes, i.e. MSR 
codes, SRCs with / = log(fc) sacrifice asymptotically negli- 
gible coding rate and have a logarithmic overhead compared 
to minimum bandwidth node repair, when at the same time 
they attain very easy repair based on simple XORs, with 
logarithmic in k number of disk accesses. 

IV. Simulations 

In addition to our theoretical analysis, we evaluate SRCs 
in a realistic cloud storage simulator. We only tested SRCs 
with / = 2 in this paper. This case allows the most efficient 
repair but at somewhat high storage overhead. We leave the 
exploration of other choices of / and the involved tradeoffs 
as future work. 

A. Simulator Introduction 

We first present the architecture of the cloud storage system 
that our simulator is modeling. The architecture contains one 
master server and a great number of data storage servers, 
similar to that of GFS |23| and Hadoop J24) . As a cloud 
storage system may store up to tens of petabytes of data, we 
expect numerous failures and hence fault tolerance and high 
availability are critical. To offer high data reliability, the master 
server needs to monitor the health status of each storage server 
and detect failures promptly. 



In the systems of interest, data is partitioned and stored as a 
number of fixed-size chunks, which in Hadoop can be 64MB 
or 128MB. Chunks form the smallest accessible data units 
and in our system are set to be 64MB. To tolerate storage 
server failures, replication or erasure codes are employed to 
generate redundant chunks. Then, several chunks are grouped 



and form a redundancy set [25|. If one chunk is lost, it can 



be reconstructed from other surviving chunks. To repair the 
chunks due to a failure event, the master server will initiate 
the repair process and schedule repair jobs. 

We implemented a discrete-event simulator of a cloud 
storage system using a similar architecture and data repair 
mechanism as Hadoop. To provide accurate simulation results, 
our simulator models most entities of the involved components 
such as machines and chunks. When performing repair jobs, 
the simulator keeps track of the details of each repair process 
which gives us a detailed performance analysis. 

B. Simulator Validation 

We first calibrated our simulator to accurately model the 
data repair behavior of Hadoop. During the validation, we ran 
one experiment on a real Hadoop system. This system contains 
16 machines, which are connected by a lGb/s network. 
Each machine has about 410GB data, namely approximately 
6400 chunks. Then, we manually failed one machine, and let 
Hadoop repair the lost data. After the repair was completed, we 
analyzed the log file of Hadoop and derived repair time of each 
chunk. Next, we ran a similar experiment in our simulator. 
We also collected the repair time of each chunk from the 
simulation. We present the CDF of the repair time of both 
experiments in Fig. [6] 
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Fig. 6. CDF of repair time 

Fig. [6] shows that the repair result of the simulation matches 
the results of the real Hadoop system very well, particularly 
when the percentile is below 95. Therefore, we conclude that 
the simulator can precisely simulate the data repair process of 
Hadoop. 




C. Storage Cost Analysis 

Now we observe how storage overhead varies when we grow 
(n,k). We compare three codes: 3-way replication, Reed- 
Solomon (RS) codes, and SRC. To make the storage overhead 
easily understood, we define the cost of storing one byte as 
the metric of how many bytes are stored for each useful byte. 
Obviously, high cost results in high storage overhead. As 3- 
way replication is a popularly used approach, we use it as the 
base line for comparison. The result is presented in Fig. [7J 
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Fig. 7. Storage cost comparison 



Fig. [7J shows that when n — k is fixed, the normalized cost 
of both the RS-code and the SRC decreases as n grows. When 
(n, k) grows to (50, 46), the normalized cost of SRC is 0.54, 
and that of RS-code is 0.36. In other words, (50, 46, 2) SRCs 
need approximately half the storage of 3-way replication. It is 
worth noting that the cost of SRCs will further reduce if we 
use larger values of /, but at the cost of slower repair. 

D. Repair Performance 

In this experiment, we measure the throughput of repairing 
one failed data server. The experiment involves a total of 100 
machines, each storing 410GB of data. We fail at random one 
machine and start the data repair process. After the repair 
is finished, we measure the elapsed time and calculate the 
repair throughput. The results are shown in Fig. [8] Note that 
the throughput of using 3-way replication is constant across 
different (n, k) since there is no such dependency on these 
parameters. 

From Fig. [8] we can make two observations. First, 3- 
way replication has the best repair performance followed by 
SRC, while the RS-code offers the worst performance. This 
is not surprising due to the amount of data that has to be 
accessed for the repair. Second, the repair performance of 
SRC remains constant on various (n, k), but the performance 
of RS-code becomes much worse as n grows. This is one of 
the major benefits of SRC, i.e., the repair performance can be 
independent from (n,k). Furthermore, the repair throughput 
of SRC is about 500MB/s, approximately 64% of the 3-way 
replication's performance. 
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Fig. 8. Repair performance comparison 



E. Degraded Read Performance 

In a real system, repair can take place in two situations. 
One situation is when we need to repair a failed data storage 
server. Another situation is when we wish to read a piece 
of data, but it is stored in a storage server that is currently 
unavailable. The two situations differ in whether the repaired 
data is stored or not. The first situation is a regular repair 
operation, which writes the repaired data back to the system. 
The second situation repairs the data in the main memory and 
then simply drops it after serving the read request. We call 
the latter degraded read. The degraded read performance is 
important, since clients can notice performance degradations 
when servers have temporary or permanent failures. 

We use a similar experimental environment to what we 



presented in section IV-D The only difference is after a chunk 
is repaired, we do not write it back. The performance results 
are presented in Fig. [9] 
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Fig. 9. Degraded read performance comparison 

We can also make two observations from Fig. [9] First, 
for all three codes, the performance trend of degraded read 



performance is similar to that of repair performance, shown in 
Fig. [8] Second, for a code with the same (n, k), the degraded 
read performance is higher than that of repair performance, 
due to less accessed data. Again, SRC achieves approximately 
60% degraded read performance of 3-way replication. 

F. Data Reliability Analysis 

Now we analyze the data reliability of an SRC cloud storage 
system. We use a simple Markov model p6) to estimate the 
reliability. For simplicity, failures happen only to disks and 
we assume no failure correlations. We note that we expect 
correlated failures to further benefit SRCs over replication 
since they spread the data to more nodes and hence achieve 
better diversity protection under correlated failure scenarios. 
This, however, remains to be verified in a more thorough 
experimental study of coded cloud storage systems. 

We assume that the mean time to failure (MTTF) of a disk 
is 5 years and the system stores 1PB data. To be conservative, 
the repair time is 15 minutes when using 3-way replication 
and 30 minutes for SRC, which is in accordance to Fig. [8] In 
the case of RS-code, the repair time depends on k of (n,k). 
With these parameters, we first measure the reliability of one 
redundancy set, and then use it to derive the reliability of 
the entire system. The estimated MTTF of the entire storage 



system is presented in Fig. 10 
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Fig. 



10 shows that the data reliability of the 3-way repli- 
cation is in the order of 10 9 . This is consistent with the 
results in [26 1. We can observe that the reliability of SRCs 
is much higher than 3-way replication. Even for the high rate 
(low storage overhead) (50, 46) case, SRCs are several orders 
of magnitude more reliable than 3-way replication. This is 
benefited from the high repair speed of SRCs. RS codes show a 
significantly different trend. Although the reliability of (10, 6) 
and (20, 16) are higher than 3-way replication, the reliability of 
the RS-code reduces greatly when (n, k) grows. This happens 
because their repair performance rapidly decreases as k grows. 



V. Conclusions 

We introduced a novel family of distributed storage codes 
that are formed by combining MDS codes and simple locally 
decodable parities for efficient repair and high fault tolerance. 
We theoretically show that our codes have the (n, k) relia- 
bility, have asymptotically optimal storage and are within a 
logarithmic factor from optimality in repair bandwidth. One 
very significant benefit is that the number of nodes that need 
to be contacted for repair can be made a small constant, inde- 
pendent of n, k. Further, SRCs can be easily implemented by 
combining any prior MDS code implementation with XORing 
of coded chunks and the appropriate chunk placement into 
nodes. 

We presented a comparison of the proposed codes with 
replication and Reed-Solomon codes using a cloud storage 
simulator. We have interest on relatively large values of (n, k) 
because when we keep n — k constant, larger values of k 
impose lower storage overhead (higher code rates). Standard 
Reed-Solomon codes cannot operate in this regime since their 
repair cost increases linearly in k. On the contrary, SRCs 
require only a constant number of nodes involved in each 
repair and can therefore achieve very good storage overhead 
with good performance. As an example, if we compare a 
(50, 46, 2) SRC with 3-way replication we find that the SRC 
requires approximately half the storage but has approximately 
60% worse degraded read performance. The main strength 
of the SRC in this comparison, however, is that it provides 
approximately four more zeros of data reliability compared to 
replication. The comparison with Reed-Solomon leads almost 
certainly to a win of SRCs when slightly more storage is 
allowed. 

In conclusion we think that SRCs add new feasible points 
in the tradeoff space of distributed storage codes. They deliver 
comparable performance to 3-way replication and significantly 
higher data reliability at a lower storage cost. Our preliminary 
investigation therefore suggests that SRCs should be attractive 
for real cloud storage systems. 
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